data selection
==============

The key document here is [**build_genre_dataset.ipynb**,](https://github.com/tedunderwood/genredistance/blob/master/select_data/build_genre_dataset.ipynb) which documents the selection of volumes for the experiment, and also (toward the end of the notebook) the process of generating measures of social proximity for genres.

**scrape_marc.py** is just here to document the process that transformed MARC records into tabular metadata.

For more on the process of deduplication, see [the repository for NovelTM metadata on English-language fiction.](https://github.com/tedunderwood/noveltmmeta)
